In this work, I will consider a dataset about cancer survival in the U.S. counties:

Exploratory Data Analysis

cancer.dataset <- read.csv("cancer_reg.csv") %>% mutate(Geography = as.factor(Geography), binnedInc = as.factor(binnedInc))
cancer.dataset %>% skim()
Data summary
Name Piped data
Number of rows 3047
Number of columns 34
_______________________
Column type frequency:
factor 2
numeric 32
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
binnedInc 0 1 FALSE 10 (45: 306, (54: 306, [22: 306, (42: 305
Geography 0 1 FALSE 3047 Abb: 1, Aca: 1, Acc: 1, Ada: 1

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
avgAnnCount 0 1.00 606.34 1416.36 6.00 76.00 171.00 518.00 38150.00 ▇▁▁▁▁
avgDeathsPerYear 0 1.00 185.97 504.13 3.00 28.00 61.00 149.00 14010.00 ▇▁▁▁▁
TARGET_deathRate 0 1.00 178.66 27.75 59.70 161.20 178.10 195.20 362.80 ▁▇▆▁▁
incidenceRate 0 1.00 448.27 54.56 201.30 420.30 453.55 480.85 1206.90 ▂▇▁▁▁
medIncome 0 1.00 47063.28 12040.09 22640.00 38882.50 45207.00 52492.00 125635.00 ▇▇▁▁▁
popEst2015 0 1.00 102637.37 329059.22 827.00 11684.00 26643.00 68671.00 10170292.00 ▇▁▁▁▁
povertyPercent 0 1.00 16.88 6.41 3.20 12.15 15.90 20.40 47.40 ▃▇▃▁▁
studyPerCap 0 1.00 155.40 529.63 0.00 0.00 0.00 83.65 9762.31 ▇▁▁▁▁
MedianAge 0 1.00 45.27 45.30 22.30 37.70 41.00 44.00 624.00 ▇▁▁▁▁
MedianAgeMale 0 1.00 39.57 5.23 22.40 36.35 39.60 42.50 64.70 ▁▇▇▁▁
MedianAgeFemale 0 1.00 42.15 5.29 22.30 39.10 42.40 45.30 65.70 ▁▃▇▂▁
AvgHouseholdSize 0 1.00 2.48 0.43 0.02 2.37 2.50 2.63 3.97 ▁▁▃▇▁
PercentMarried 0 1.00 51.77 6.90 23.10 47.75 52.40 56.40 72.50 ▁▂▇▇▁
PctNoHS18_24 0 1.00 18.22 8.09 0.00 12.80 17.10 22.70 64.10 ▃▇▂▁▁
PctHS18_24 0 1.00 35.00 9.07 0.00 29.20 34.70 40.70 72.50 ▁▃▇▂▁
PctSomeCol18_24 2285 0.25 40.98 11.12 7.10 34.00 40.40 46.40 79.00 ▁▅▇▂▁
PctBachDeg18_24 0 1.00 6.16 4.53 0.00 3.10 5.40 8.20 51.80 ▇▁▁▁▁
PctHS25_Over 0 1.00 34.80 7.03 7.50 30.40 35.30 39.65 54.80 ▁▂▇▇▁
PctBachDeg25_Over 0 1.00 13.28 5.39 2.50 9.40 12.30 16.10 42.20 ▆▇▂▁▁
PctEmployed16_Over 152 0.95 54.15 8.32 17.60 48.60 54.50 60.30 80.10 ▁▂▇▇▁
PctUnemployed16_Over 0 1.00 7.85 3.45 0.40 5.50 7.60 9.70 29.40 ▅▇▁▁▁
PctPrivateCoverage 0 1.00 64.35 10.65 22.30 57.20 65.10 72.10 92.30 ▁▂▇▇▂
PctPrivateCoverageAlone 609 0.80 48.45 10.08 15.70 41.00 48.70 55.60 78.90 ▁▅▇▅▁
PctEmpPrivCoverage 0 1.00 41.20 9.45 13.50 34.50 41.10 47.70 70.70 ▁▅▇▃▁
PctPublicCoverage 0 1.00 36.25 7.84 11.20 30.90 36.30 41.55 65.10 ▁▅▇▂▁
PctPublicCoverageAlone 0 1.00 19.24 6.11 2.60 14.85 18.80 23.10 46.60 ▂▇▆▁▁
PctWhite 0 1.00 83.65 16.38 10.20 77.30 90.06 95.45 100.00 ▁▁▁▂▇
PctBlack 0 1.00 9.11 14.53 0.00 0.62 2.25 10.51 85.95 ▇▁▁▁▁
PctAsian 0 1.00 1.25 2.61 0.00 0.25 0.55 1.22 42.62 ▇▁▁▁▁
PctOtherRace 0 1.00 1.98 3.52 0.00 0.30 0.83 2.18 41.93 ▇▁▁▁▁
PctMarriedHouseholds 0 1.00 51.24 6.57 22.99 47.76 51.67 55.40 78.08 ▁▂▇▂▁
BirthRate 0 1.00 5.64 1.99 0.00 4.52 5.38 6.49 21.33 ▂▇▁▁▁

Features

These descriptions were directly taken from the dataset’s website:

  • TARGET_deathRate: Dependent variable. Mean per capita (100,000) cancer mortalities(a)
  • avgAnnCount: Mean number of reported cases of cancer diagnosed annually(a)
  • avgDeathsPerYear: Mean number of reported mortalities due to cancer(a)
  • incidenceRate: Mean per capita (100,000) cancer diagoses(a)
  • medianIncome: Median income per county (b)
  • popEst2015: Population of county (b)
  • povertyPercent: Percent of populace in poverty (b)
  • studyPerCap: Per capita number of cancer-related clinical trials per county (a)
  • binnedInc: Median income per capita binned by decile (b)
  • MedianAge: Median age of county residents (b)
  • MedianAgeMale: Median age of male county residents (b)
  • MedianAgeFemale: Median age of female county residents (b)
  • Geography: County name (b)
  • AvgHouseholdSize: Mean household size of county (b)
  • PercentMarried: Percent of county residents who are married (b)
  • PctNoHS18_24: Percent of county residents ages 18-24 highest education attained: less than high school (b)
  • PctHS18_24: Percent of county residents ages 18-24 highest education attained: high school diploma (b)
  • PctSomeCol18_24: Percent of county residents ages 18-24 highest education attained: some college (b)
  • PctBachDeg18_24: Percent of county residents ages 18-24 highest education attained: bachelor’s degree (b)
  • PctHS25_Over: Percent of county residents ages 25 and over highest education attained: high school diploma (b)
  • PctBachDeg25_Over: Percent of county residents ages 25 and over highest education attained: bachelor’s degree (b)
  • PctEmployed16_Over: Percent of county residents ages 16 and over employed (b)
  • PctUnemployed16_Over: Percent of county residents ages 16 and over unemployed (b)
  • PctPrivateCoverage: Percent of county residents with private health coverage (b)
  • PctPrivateCoverageAlone: Percent of county residents with private health coverage alone (no public assistance) (b)
  • PctEmpPrivCoverage: Percent of county residents with employee-provided private health coverage (b)
  • PctPublicCoverage: Percent of county residents with government-provided health coverage (b)
  • PctPubliceCoverageAlone: Percent of county residents with government-provided health coverage alone (b)
  • PctWhite: Percent of county residents who identify as White (b)
  • PctBlack: Percent of county residents who identify as Black (b)
  • PctAsian: Percent of county residents who identify as Asian (b)
  • PctOtherRace: Percent of county residents who identify in a category which is not White, Black, or Asian (b)
  • PctMarriedHouseholds: Percent of married households (b)

BirthRate: Number of live births relative to number of women in county (b)

Additional notes on features

For Geography, each observation is relative to a county in the U.S., they are just like identifiers. The variables are divided mainly in two categories economy and anagraphic. The first category includes variables about employment, education, household size, health insurances and so on. The latter has data about the age of the people, the number of cancer diagnoses, their ethnicity (we will check for dependencies with income and ethnicity), marriages and similar.

We can prepare some scatterplots to see if there are dependencies between these variables:

  • Economic features
pairs(cancer.dataset %>% select(
  medIncome,
  povertyPercent,
  AvgHouseholdSize,
  PctPrivateCoverage,
  PctPrivateCoverageAlone,
  PctEmpPrivCoverage,
  PctPublicCoverage,
  PctPublicCoverageAlone
))

* Anagraphic features:

pairs(cancer.dataset %>% select(
  PctHS18_24,
  PctSomeCol18_24,
  binnedInc,
  PctBachDeg18_24,
  PctHS25_Over,
  PctBachDeg25_Over,
  PctEmployed16_Over,
  PctUnemployed16_Over
))

  • Ethnicity and income
pairs(cancer.dataset %>% select(
  PctWhite,
  PctBlack,
  PctAsian,
  PctOtherRace,
  medIncome
))

> In this plot it is shown a correlation between medIncome and the percentage of population with a certain ethnicity.

  • A selection of features that seem, at a first sight and in my opinion, correlated to cancer death:
pairs(cancer.dataset %>% select(avgAnnCount, avgDeathsPerYear, incidenceRate, medIncome, popEst2015, povertyPercent))

Missing values

Three columns have NA values in quite different proportions:

aggr(cancer.dataset)

We can impute a value for them using packages as mice:

## 
##  iter imp variable
##   1   1  PctSomeCol18_24  PctEmployed16_Over  PctPrivateCoverageAlone
##   1   2  PctSomeCol18_24  PctEmployed16_Over  PctPrivateCoverageAlone
##   1   3  PctSomeCol18_24  PctEmployed16_Over  PctPrivateCoverageAlone
##   1   4  PctSomeCol18_24  PctEmployed16_Over  PctPrivateCoverageAlone
##   1   5  PctSomeCol18_24  PctEmployed16_Over  PctPrivateCoverageAlone
##   2   1  PctSomeCol18_24  PctEmployed16_Over  PctPrivateCoverageAlone
##   2   2  PctSomeCol18_24  PctEmployed16_Over  PctPrivateCoverageAlone
##   2   3  PctSomeCol18_24  PctEmployed16_Over  PctPrivateCoverageAlone
##   2   4  PctSomeCol18_24  PctEmployed16_Over  PctPrivateCoverageAlone
##   2   5  PctSomeCol18_24  PctEmployed16_Over  PctPrivateCoverageAlone
##   3   1  PctSomeCol18_24  PctEmployed16_Over  PctPrivateCoverageAlone
##   3   2  PctSomeCol18_24  PctEmployed16_Over  PctPrivateCoverageAlone
##   3   3  PctSomeCol18_24  PctEmployed16_Over  PctPrivateCoverageAlone
##   3   4  PctSomeCol18_24  PctEmployed16_Over  PctPrivateCoverageAlone
##   3   5  PctSomeCol18_24  PctEmployed16_Over  PctPrivateCoverageAlone
##   4   1  PctSomeCol18_24  PctEmployed16_Over  PctPrivateCoverageAlone
##   4   2  PctSomeCol18_24  PctEmployed16_Over  PctPrivateCoverageAlone
##   4   3  PctSomeCol18_24  PctEmployed16_Over  PctPrivateCoverageAlone
##   4   4  PctSomeCol18_24  PctEmployed16_Over  PctPrivateCoverageAlone
##   4   5  PctSomeCol18_24  PctEmployed16_Over  PctPrivateCoverageAlone
##   5   1  PctSomeCol18_24  PctEmployed16_Over  PctPrivateCoverageAlone
##   5   2  PctSomeCol18_24  PctEmployed16_Over  PctPrivateCoverageAlone
##   5   3  PctSomeCol18_24  PctEmployed16_Over  PctPrivateCoverageAlone
##   5   4  PctSomeCol18_24  PctEmployed16_Over  PctPrivateCoverageAlone
##   5   5  PctSomeCol18_24  PctEmployed16_Over  PctPrivateCoverageAlone

df <- cancer.dataset %>% mutate(
  PctSomeCol18_24 = df_filled$PctSomeCol18_24,
  PctEmployed16_Over = df_filled$PctEmployed16_Over,
  PctPrivateCoverageAlone = df_filled$PctPrivateCoverageAlone
)
cormat <- round(cor(df %>% select(-Geography, -binnedInc)), 2)
#no NA removal is needed as we imputed the missing data
#cormat <- cormat %>% apply(1:2, function(x){ifelse(is.na(x),0,x)})
ggcorrplot(cormat, hc.order = TRUE, outline.col = "white", tl.cex = 5.75)

As linear regression was a part of this course, we will find the best way to model NA values by using this technique manually.

Linear regression models for NA imputation

The variables we have to impute are: PctSomeCol18_24, PctEmployed16_Over and PctPrivateCoverageAlone

PctSomeCol18_24

Firstly, we remove the two factor variables and the other two columns with NAs as we will not use them for the imputation. We then keep only complete observations. We also remove the response variable to avoid creating artificial relationships between the imputed variables and the response.

df <- cancer.dataset %>% select(
  - Geography,
  - binnedInc,
  - PctEmployed16_Over,
  - PctPrivateCoverageAlone,
  - TARGET_deathRate
) %>% filter(
  !is.na(PctSomeCol18_24)
)
nrow(df)
## [1] 762
mod.full <- lm(PctSomeCol18_24 ~ ., data = df)
summary(mod.full)
## 
## Call:
## lm(formula = PctSomeCol18_24 ~ ., data = df)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.116527 -0.010921  0.001025  0.012271  0.122172 
## 
## Coefficients:
##                          Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)             1.000e+02  8.096e-02  1235.419   <2e-16 ***
## avgAnnCount            -1.462e-06  4.309e-06    -0.339    0.734    
## avgDeathsPerYear       -1.544e-05  2.331e-05    -0.662    0.508    
## incidenceRate           5.602e-05  4.212e-05     1.330    0.184    
## medIncome               1.202e-07  5.160e-07     0.233    0.816    
## popEst2015              3.989e-08  3.009e-08     1.326    0.185    
## povertyPercent          1.032e-04  9.196e-04     0.112    0.911    
## studyPerCap             2.638e-06  6.270e-06     0.421    0.674    
## MedianAge               4.732e-05  4.565e-05     1.037    0.300    
## MedianAgeMale           1.857e-03  1.195e-03     1.554    0.121    
## MedianAgeFemale        -1.852e-03  1.240e-03    -1.493    0.136    
## AvgHouseholdSize        2.526e-03  5.389e-03     0.469    0.639    
## PercentMarried         -7.292e-04  9.074e-04    -0.804    0.422    
## PctNoHS18_24           -9.993e-01  3.394e-04 -2944.706   <2e-16 ***
## PctHS18_24             -1.000e+00  2.833e-04 -3530.419   <2e-16 ***
## PctBachDeg18_24        -1.001e+00  6.815e-04 -1468.177   <2e-16 ***
## PctHS25_Over           -6.381e-04  5.809e-04    -1.098    0.272    
## PctBachDeg25_Over       2.105e-04  9.077e-04     0.232    0.817    
## PctUnemployed16_Over   -1.211e-03  9.749e-04    -1.242    0.215    
## PctPrivateCoverage      2.560e-04  7.917e-04     0.323    0.746    
## PctEmpPrivCoverage     -5.603e-04  5.819e-04    -0.963    0.336    
## PctPublicCoverage      -1.482e-03  1.319e-03    -1.123    0.262    
## PctPublicCoverageAlone  2.212e-03  1.717e-03     1.289    0.198    
## PctWhite                3.156e-04  3.331e-04     0.948    0.344    
## PctBlack               -1.050e-04  3.177e-04    -0.331    0.741    
## PctAsian               -1.120e-03  1.206e-03    -0.928    0.353    
## PctOtherRace           -9.318e-04  7.828e-04    -1.190    0.234    
## PctMarriedHouseholds   -2.541e-04  8.537e-04    -0.298    0.766    
## BirthRate               1.197e-03  1.211e-03     0.989    0.323    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05631 on 733 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.059e+06 on 28 and 733 DF,  p-value: < 2.2e-16

We see that other features with values very related to this variable are considered as significant. However, it could be useful to apply strategies for feature selection to reduce the complexity of the model.

We apply forward an backward feature selection:

mod.fwd <- regsubsets(PctSomeCol18_24 ~ ., data = df, method = "forward")
mod.fwd.summary <- summary(mod.fwd)
print("BIC")
## [1] "BIC"
print(mod.fwd.summary$bic)
## [1]  -637.3475 -1699.9792 -8024.3332 -8025.5098 -8022.2539 -8018.0429 -8012.6549
## [8] -8007.9224
print("R²-adj")
## [1] "R²-adj"
mod.fwd.summary$adjr2
## [1] 0.5736592 0.8950675 0.9999741 0.9999743 0.9999744 0.9999745 0.9999745
## [8] 0.9999745
plot(mod.fwd.summary$adjr2, main = "Forward [RED: best R², BLUE: best BIC]")
adjr2.max <- which.max(mod.fwd.summary$adjr2)
points(adjr2.max, mod.fwd.summary$adjr2[adjr2.max], 
        col = "red", cex = 2, pch = 20)
bic.min <- which.min(mod.fwd.summary$bic)
points(bic.min, mod.fwd.summary$adjr2[bic.min], 
        col = "blue", cex = 2, pch = 20)

mod.bkw <- regsubsets(PctSomeCol18_24 ~ ., data = df, method = "backward")
mod.bkw.summary <- summary(mod.bkw)
print("BIC")
## [1] "BIC"
print(mod.bkw.summary$bic)
## [1]  -637.3475 -1699.9792 -8024.3332 -8025.5098 -8022.2539 -8016.5556 -8012.0985
## [8] -8007.3559
print("R²-adj")
## [1] "R²-adj"
mod.bkw.summary$adjr2
## [1] 0.5736592 0.8950675 0.9999741 0.9999743 0.9999744 0.9999744 0.9999745
## [8] 0.9999745
plot(mod.bkw.summary$adjr2,  main = "Backward [RED: best R², BLUE: best BIC]")
adjr2.max <- which.max(mod.bkw.summary$adjr2)
points(adjr2.max, mod.bkw.summary$adjr2[adjr2.max], 
        col = "red", cex = 2, pch = 20)
bic.min <- which.min(mod.bkw.summary$bic)
points(bic.min, mod.bkw.summary$adjr2[bic.min], 
        col = "blue", cex = 2, pch = 20)

# use coef(mod.bkw, n) to extract the coefficients 

The fourth model is the best in regards of BIC metric in both approaches, with the same exact score.

# helper function to predict run predicitons from coefficients
predict.from.coefficients <- function(df, coefs){
  as.matrix(df[, names(coefs)[-1]]) %*% coefs[-1] + coefs[1]
}

We can then plot the squared error on the training data:

coefs <- coef(mod.bkw, 4)
print(coefs)
##     (Intercept)      popEst2015    PctNoHS18_24      PctHS18_24 PctBachDeg18_24 
##    1.000031e+02    1.262458e-08   -9.996401e-01   -1.000229e+00   -1.000717e+00
preds <- predict.from.coefficients(df, coefs)
ggplot(data.frame(squared_error = (preds-df$PctSomeCol18_24)^2)) +
  geom_density(aes(x = squared_error)) +
  labs(title="Squared error distribution for PctSomeCol18_24 prediction")

# sd for stochastic imputation
std.dev.bkw.imp <- sd(df$PctSomeCol18_24)

We can now use this model to impute the NAs:

imputation <- predict.from.coefficients(cancer.dataset, coef(mod.bkw, 4)) + rnorm(1:nrow(cancer.dataset), mean = 0, sd = std.dev.bkw.imp)
cancer.dataset.filled <- cancer.dataset %>% mutate(PctSomeCol18_24 = ifelse(is.na(PctSomeCol18_24), imputation, PctSomeCol18_24))
ggplot(cancer.dataset.filled) +
  geom_point(aes(x=PctSomeCol18_24, y=TARGET_deathRate)) +
  geom_point(data=cancer.dataset, aes(x=PctSomeCol18_24, y=TARGET_deathRate), color="red", alpha = 0.5) +
  labs(title = "College education (18-24) vs Death rate, red dots are the values in data")

We can now move on to the other two variables.

PctEmployed16_Over

df <- cancer.dataset %>% select(
  - Geography,
  - binnedInc,
  - PctSomeCol18_24,
  - PctPrivateCoverageAlone,
  - TARGET_deathRate
) %>% filter(
  !is.na(PctEmployed16_Over)
)
nrow(df)
## [1] 2895
mod.full <- lm(PctEmployed16_Over ~ ., data = df)
summary(mod.full)
## 
## Call:
## lm(formula = PctEmployed16_Over ~ ., data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.0053  -1.7858   0.1298   1.8743  14.0818 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             8.059e+01  2.268e+00  35.538  < 2e-16 ***
## avgAnnCount             6.768e-04  1.309e-04   5.169 2.51e-07 ***
## avgDeathsPerYear       -2.502e-03  6.663e-04  -3.755 0.000177 ***
## incidenceRate          -2.703e-03  1.289e-03  -2.097 0.036050 *  
## medIncome               3.902e-07  1.379e-05   0.028 0.977425    
## popEst2015              9.946e-07  9.342e-07   1.065 0.287121    
## povertyPercent         -5.024e-01  2.612e-02 -19.237  < 2e-16 ***
## studyPerCap             2.440e-04  1.131e-04   2.158 0.031043 *  
## MedianAge               1.523e-03  1.361e-03   1.119 0.263244    
## MedianAgeMale           2.569e-02  3.567e-02   0.720 0.471442    
## MedianAgeFemale        -2.894e-01  3.674e-02  -7.878 4.70e-15 ***
## AvgHouseholdSize        5.816e-01  1.647e-01   3.531 0.000421 ***
## PercentMarried          7.976e-01  2.516e-02  31.694  < 2e-16 ***
## PctNoHS18_24           -3.355e-02  9.534e-03  -3.519 0.000439 ***
## PctHS18_24             -5.025e-02  8.366e-03  -6.007 2.12e-09 ***
## PctBachDeg18_24         2.400e-02  1.843e-02   1.303 0.192834    
## PctHS25_Over            6.649e-02  1.653e-02   4.024 5.88e-05 ***
## PctBachDeg25_Over       1.988e-01  2.661e-02   7.473 1.03e-13 ***
## PctUnemployed16_Over   -4.452e-01  2.735e-02 -16.275  < 2e-16 ***
## PctPrivateCoverage      4.696e-03  2.239e-02   0.210 0.833876    
## PctEmpPrivCoverage      1.188e-01  1.768e-02   6.718 2.21e-11 ***
## PctPublicCoverage      -5.779e-01  3.715e-02 -15.555  < 2e-16 ***
## PctPublicCoverageAlone  5.327e-01  4.695e-02  11.345  < 2e-16 ***
## PctWhite               -3.251e-02  9.910e-03  -3.280 0.001049 ** 
## PctBlack               -1.356e-02  9.677e-03  -1.402 0.161105    
## PctAsian               -1.285e-01  3.184e-02  -4.036 5.57e-05 ***
## PctOtherRace            4.046e-03  2.106e-02   0.192 0.847704    
## PctMarriedHouseholds   -7.755e-01  2.402e-02 -32.287  < 2e-16 ***
## BirthRate               1.038e-01  3.263e-02   3.180 0.001488 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.219 on 2866 degrees of freedom
## Multiple R-squared:  0.8516, Adjusted R-squared:  0.8501 
## F-statistic: 587.3 on 28 and 2866 DF,  p-value: < 2.2e-16

Here we see how so many more variables have a significance this time:

mod.fwd <- regsubsets(PctEmployed16_Over ~ ., data = df, method = "forward", nvmax=20)
mod.fwd.summary <- summary(mod.fwd)
print("BIC")
## [1] "BIC"
print(mod.fwd.summary$bic)
##  [1] -2599.157 -3373.836 -3590.351 -3885.055 -4183.619 -4951.858 -5108.831
##  [8] -5206.403 -5264.775 -5290.649 -5307.317 -5315.604 -5321.236 -5328.088
## [15] -5327.320 -5329.351 -5332.527 -5329.289 -5326.009 -5321.594
print("R²-adj")
## [1] "R²-adj"
mod.fwd.summary$adjr2
##  [1] 0.5946360 0.6905533 0.7135429 0.7418904 0.7677427 0.8223045 0.8320876
##  [8] 0.8380428 0.8416572 0.8434434 0.8447163 0.8455323 0.8462030 0.8469353
## [15] 0.8472626 0.8477365 0.8482690 0.8484641 0.8486568 0.8487900
plot(mod.fwd.summary$adjr2, main = "Forward [RED: best R², BLUE: best BIC]")
adjr2.max <- which.max(mod.fwd.summary$adjr2)
points(adjr2.max, mod.fwd.summary$adjr2[adjr2.max],
        col = "red", cex = 2, pch = 20)
bic.min <- which.min(mod.fwd.summary$bic)
points(bic.min, mod.fwd.summary$adjr2[bic.min], 
        col = "blue", cex = 2, pch = 20)

mod.bkw <- regsubsets(PctEmployed16_Over ~ ., data = df, method = "backward", nvmax=20)
mod.bkw.summary <- summary(mod.bkw)
print("BIC")
## [1] "BIC"
print(mod.bkw.summary$bic)
##  [1] -2599.157 -2979.831 -3820.672 -4201.440 -4687.407 -4951.858 -5108.831
##  [8] -5198.908 -5294.366 -5288.172 -5313.537 -5327.879 -5335.826 -5344.923
## [15] -5343.785 -5346.442 -5350.331 -5354.096 -5351.775 -5348.049
print("R²-adj")
## [1] "R²-adj"
mod.bkw.summary$adjr2
##  [1] 0.5946360 0.6454376 0.7354498 0.7686117 0.8048388 0.8223045 0.8320876
##  [8] 0.8376230 0.8432675 0.8433093 0.8450495 0.8461858 0.8469761 0.8478228
## [15] 0.8481289 0.8486328 0.8491992 0.8497571 0.8499978 0.8501655
plot(mod.bkw.summary$adjr2, main = "Backward [RED: best R², BLUE: best BIC]")
adjr2.max <- which.max(mod.bkw.summary$adjr2)
points(adjr2.max, mod.bkw.summary$adjr2[adjr2.max], 
        col = "red", cex = 2, pch = 20)
bic.min <- which.min(mod.bkw.summary$bic)
points(bic.min, mod.bkw.summary$adjr2[bic.min], 
        col = "blue", cex = 2, pch = 20)

For this feature, we test our models also with cross validation using caret package:

ten.folds <- trainControl(method = "cv", number = 10)
# leave.one.out <- trainControl(method = "LOOCV")
ten.folds.models <- list()
ten.folds.models[[21]] <- train(PctEmployed16_Over ~ ., data = df, method = "lm", trControl = ten.folds)
# leave.one.out.model <- train(PctEmployed16_Over ~ ., data = df, method = "lm", trControl = leave.one.out)

outcomes <- cancer.dataset %>% filter(!is.na(PctEmployed16_Over)) %>% pull(PctEmployed16_Over)
for(i in 1:20){
  if(i > 1)
    ten.folds.models[[i]] <- train(y = outcomes, x = as.matrix(df[, names(coef(mod.bkw, i))[-1]]), method = "lm", trControl = ten.folds)
  else 
    ten.folds.models[[i]] <- train(y = outcomes, x = df[, names(coef(mod.bkw, 1))[-1], drop=FALSE], method = "lm", trControl = ten.folds)
}
rsqrs <- 1:21
for(i in 1:21){
  rsqrs[i] <- ten.folds.models[[i]]$results$Rsquared
}
plot(1:21, rsqrs, main = "Distribution of R² w.r.t the number of used variables")
points(which.max(rsqrs), max(rsqrs), col = "red", cex = 2, pch = 20)

rmses <- 1:21
for(i in 1:21){
  rmses[i] <- ten.folds.models[[i]]$results$RMSE
}
plot(1:21, rmses, main = "Distribution of RMSE w.r.t the number of used variables")
points(which.min(rmses), min(rmses), col = "red", cex = 2, pch = 20)

bics <- 1:21
for(i in 1:21){
  bics[i] <- BIC(ten.folds.models[[i]]$finalModel)
}
plot(1:21, bics, main = "Distribution of BIC w.r.t the number of used variables")
points(which.min(bics), min(bics), col = "red", cex = 2, pch = 20)

Note that the last point to the right in the previous plots was computed by using all the variables. Using BIC, the best model is achieved with 18 variables also after CV, this is reflected in the R-squared.

coefs <- ten.folds.models[[18]]$finalModel$coefficients
print(coefs)
##            (Intercept)            avgAnnCount       avgDeathsPerYear 
##          79.1608747576           0.0006948743          -0.0019189410 
##         povertyPercent        MedianAgeFemale       AvgHouseholdSize 
##          -0.5100425805          -0.2725039006           0.6493057891 
##         PercentMarried           PctNoHS18_24             PctHS18_24 
##           0.8025860025          -0.0325926174          -0.0516786056 
##           PctHS25_Over      PctBachDeg25_Over   PctUnemployed16_Over 
##           0.0636106665           0.2114370870          -0.4567802805 
##     PctEmpPrivCoverage      PctPublicCoverage PctPublicCoverageAlone 
##           0.1124319697          -0.5774574496           0.5300125134 
##               PctWhite               PctAsian   PctMarriedHouseholds 
##          -0.0208750381          -0.0993396784          -0.7759655151 
##              BirthRate 
##           0.1094950529
preds <- predict.from.coefficients(df, coefs)
ggplot(data.frame(squared_error = (preds-df$PctEmployed16_Over)^2)) +
  geom_density(aes(x = squared_error))

# sd for stochastic imputation
std.dev.bkw.imp <- sd(df$PctEmployed16_Over)

we can now impute the NAs:

imputation <- predict.from.coefficients(cancer.dataset, coefs) + rnorm(1:nrow(cancer.dataset), mean = 0, sd = std.dev.bkw.imp)
cancer.dataset.filled <- cancer.dataset.filled %>% mutate(PctEmployed16_Over = ifelse(is.na(PctEmployed16_Over), imputation, PctEmployed16_Over))
ggplot(cancer.dataset.filled) +
  geom_point(aes(x=PctEmployed16_Over, y=TARGET_deathRate)) +
  geom_point(data=cancer.dataset, aes(x=PctEmployed16_Over, y=TARGET_deathRate), color="red", alpha=0.5) +
  labs(title = "Employment (>16) vs Death rate, red dots are the original values")

PctPrivateCoverageAlone

Finally we repeat the first procedure on this last variable:

df <- cancer.dataset %>% select(
  - Geography,
  - binnedInc,
  - PctEmployed16_Over,
  - PctSomeCol18_24,
  - TARGET_deathRate
) %>% filter(
  !is.na(PctPrivateCoverageAlone)
)
nrow(df)
## [1] 2438
mod.full <- lm(PctPrivateCoverageAlone ~ ., data = df)
summary(mod.full)
## 
## Call:
## lm(formula = PctPrivateCoverageAlone ~ ., data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9186 -0.5239  0.0172  0.5529  6.5346 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.316e+00  8.535e-01   1.542 0.123267    
## avgAnnCount             7.877e-05  4.941e-05   1.594 0.111022    
## avgDeathsPerYear       -8.305e-04  2.581e-04  -3.218 0.001307 ** 
## incidenceRate          -3.400e-04  4.638e-04  -0.733 0.463529    
## medIncome              -9.107e-06  5.192e-06  -1.754 0.079540 .  
## popEst2015              1.063e-06  3.559e-07   2.987 0.002848 ** 
## povertyPercent          5.054e-02  9.972e-03   5.068 4.32e-07 ***
## studyPerCap            -3.557e-05  4.148e-05  -0.858 0.391199    
## MedianAge              -5.615e-04  5.155e-04  -1.089 0.276214    
## MedianAgeMale           1.817e-02  1.337e-02   1.359 0.174329    
## MedianAgeFemale         2.692e-02  1.395e-02   1.930 0.053719 .  
## AvgHouseholdSize       -7.919e-02  6.129e-02  -1.292 0.196412    
## PercentMarried          7.462e-03  9.698e-03   0.769 0.441738    
## PctNoHS18_24           -5.658e-04  3.637e-03  -0.156 0.876382    
## PctHS18_24             -5.174e-03  3.207e-03  -1.614 0.106762    
## PctBachDeg18_24        -5.447e-03  7.086e-03  -0.769 0.442121    
## PctHS25_Over           -3.167e-03  6.336e-03  -0.500 0.617254    
## PctBachDeg25_Over       4.307e-02  9.971e-03   4.320 1.63e-05 ***
## PctUnemployed16_Over   -2.988e-02  1.047e-02  -2.853 0.004365 ** 
## PctPrivateCoverage      7.217e-01  8.500e-03  84.902  < 2e-16 ***
## PctEmpPrivCoverage      2.181e-01  6.744e-03  32.342  < 2e-16 ***
## PctPublicCoverage      -7.103e-01  1.415e-02 -50.181  < 2e-16 ***
## PctPublicCoverageAlone  7.288e-01  1.786e-02  40.809  < 2e-16 ***
## PctWhite                1.543e-02  3.752e-03   4.111 4.07e-05 ***
## PctBlack                1.220e-02  3.603e-03   3.385 0.000722 ***
## PctAsian                1.843e-03  1.244e-02   0.148 0.882236    
## PctOtherRace           -2.181e-02  7.721e-03  -2.825 0.004767 ** 
## PctMarriedHouseholds   -1.894e-03  9.242e-03  -0.205 0.837630    
## BirthRate              -2.340e-02  1.219e-02  -1.919 0.055128 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.125 on 2409 degrees of freedom
## Multiple R-squared:  0.9877, Adjusted R-squared:  0.9875 
## F-statistic:  6900 on 28 and 2409 DF,  p-value: < 2.2e-16
mod.fwd <- regsubsets(PctPrivateCoverageAlone ~ ., data = df, method = "forward", nvmax = 20)
mod.fwd.summary <- summary(mod.fwd)
print("BIC")
## [1] "BIC"
print(mod.fwd.summary$bic)
##  [1]  -4985.132  -7236.487  -8839.152 -10457.029 -10482.244 -10523.640
##  [7] -10545.292 -10576.493 -10588.916 -10593.460 -10590.991 -10587.245
## [13] -10583.144 -10584.446 -10580.583 -10575.749 -10570.498 -10565.018
## [19] -10558.867 -10552.382
print("R²-adj")
## [1] "R²-adj"
mod.fwd.summary$adjr2
##  [1] 0.8713617 0.9490536 0.9736724 0.9864793 0.9866557 0.9869169 0.9870687
##  [8] 0.9872686 0.9873686 0.9874272 0.9874495 0.9874652 0.9874790 0.9875205
## [15] 0.9875355 0.9875455 0.9875534 0.9875601 0.9875633 0.9875649
plot(mod.fwd.summary$adjr2, main = "Forward [RED: best R², BLUE: best BIC]")
adjr2.max <- which.max(mod.fwd.summary$adjr2)
points(adjr2.max, mod.fwd.summary$adjr2[adjr2.max], 
        col = "red", cex = 2, pch = 20)
bic.min <- which.min(mod.fwd.summary$bic)
points(bic.min, mod.fwd.summary$adjr2[bic.min], 
        col = "blue", cex = 2, pch = 20)

mod.bkw <- regsubsets(PctPrivateCoverageAlone ~ ., data = df, method = "backward", nvmax = 20)
mod.bkw.summary <- summary(mod.bkw)
print("BIC")
## [1] "BIC"
print(mod.bkw.summary$bic)
##  [1]  -4985.132  -7031.479  -9535.493 -10457.029 -10482.244 -10523.640
##  [7] -10545.292 -10576.493 -10588.916 -10593.460 -10588.143 -10588.650
## [13] -10588.009 -10584.897 -10580.583 -10575.749 -10570.498 -10565.018
## [19] -10558.867 -10552.382
print("R²-adj")
## [1] "R²-adj"
mod.bkw.summary$adjr2
##  [1] 0.8713617 0.9445844 0.9802136 0.9864793 0.9866557 0.9869169 0.9870687
##  [8] 0.9872686 0.9873686 0.9874272 0.9874348 0.9874724 0.9875040 0.9875228
## [15] 0.9875355 0.9875455 0.9875534 0.9875601 0.9875633 0.9875649
plot(mod.bkw.summary$adjr2, main = "Backward [RED: best R², BLUE: best BIC]")
adjr2.max <- which.max(mod.bkw.summary$adjr2)
points(adjr2.max, mod.bkw.summary$adjr2[adjr2.max], 
        col = "red", cex = 2, pch = 20)
bic.min <- which.min(mod.bkw.summary$bic)
points(bic.min, mod.bkw.summary$adjr2[bic.min], 
        col = "blue", cex = 2, pch = 20)

# use coef(mod.bkw, n) to extract the coefficients 

The best model by BIC has 10 variables (we use the backword strategy selection even if the forward one gave equivalent results)

predict.from.coefficients <- function(df, coefs){
  as.matrix(df[, names(coefs)[-1]]) %*% coefs[-1] + coefs[1]
}
coefs <- coef(mod.bkw, 10)
print(coefs)
##            (Intercept)         povertyPercent        MedianAgeFemale 
##            -0.32445181             0.06280057             0.04163113 
##      PctBachDeg25_Over   PctUnemployed16_Over     PctPrivateCoverage 
##             0.04674681            -0.03462309             0.72354376 
##     PctEmpPrivCoverage      PctPublicCoverage PctPublicCoverageAlone 
##             0.21387879            -0.70265761             0.71934483 
##               PctWhite               PctBlack 
##             0.02059450             0.01491203
preds <- predict.from.coefficients(df, coefs)
ggplot(data.frame(squared_error = (preds-df$PctPrivateCoverageAlone)^2)) +
  geom_density(aes(x = squared_error))

# sd for stochastic imputation
std.dev.bkw.imp <- sd(df$PctPrivateCoverageAlone)
imputation <- predict.from.coefficients(cancer.dataset, coefs) + rnorm(1:nrow(cancer.dataset), mean = 0, sd = std.dev.bkw.imp)
cancer.dataset.filled <- cancer.dataset.filled %>% mutate(PctPrivateCoverageAlone = ifelse(is.na(PctPrivateCoverageAlone), imputation, PctPrivateCoverageAlone))
ggplot(cancer.dataset.filled) +
  geom_point(aes(x=PctPrivateCoverageAlone, y=TARGET_deathRate)) +
  geom_point(data=cancer.dataset, aes(x=PctPrivateCoverageAlone, y=TARGET_deathRate), color="red", alpha = 0.5) 

Checking imputation results

Now we have prepared a dataset without missng values

aggr(cancer.dataset.filled)

cormat <- round(cor(cancer.dataset.filled %>% select(-Geography, -binnedInc)), 2)
ggcorrplot(cormat, hc.order = TRUE, outline.col = "white", tl.cex = 5.5)

The correlation matrix looks very similar to the one created doing imputation with mice package.

Creating a linear regression model for the TARGET variable

We now want to create a linear regression model for the cancer death rate. In doing so we are also interested in which are the most relevant features in this context and, to this end, we will use the lasso methodology.

cancer.dataset.filled <- cancer.dataset.filled %>% select(-Geography, -binnedInc)

We firstly plot the summary of the a full linear model and some helpful scatterplots:

mod.full <- lm(TARGET_deathRate ~ ., data = cancer.dataset.filled)
summary(mod.full)
## 
## Call:
## lm(formula = TARGET_deathRate ~ ., data = cancer.dataset.filled)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -99.149 -10.747  -0.399  10.571 134.204 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1.624e+02  1.568e+01  10.359  < 2e-16 ***
## avgAnnCount             -3.248e-03  7.709e-04  -4.214 2.59e-05 ***
## avgDeathsPerYear         1.750e-02  3.886e-03   4.504 6.91e-06 ***
## incidenceRate            1.915e-01  7.204e-03  26.587  < 2e-16 ***
## medIncome                1.117e-04  8.010e-05   1.395 0.163128    
## popEst2015              -1.517e-05  5.453e-06  -2.783 0.005422 ** 
## povertyPercent           3.817e-01  1.605e-01   2.378 0.017467 *  
## studyPerCap              1.705e-05  6.745e-04   0.025 0.979829    
## MedianAge               -2.160e-03  7.795e-03  -0.277 0.781746    
## MedianAgeMale           -4.929e-01  2.078e-01  -2.372 0.017774 *  
## MedianAgeFemale         -1.123e-01  2.162e-01  -0.519 0.603692    
## AvgHouseholdSize         8.048e-01  9.557e-01   0.842 0.399823    
## PercentMarried           1.305e+00  1.649e-01   7.912 3.53e-15 ***
## PctNoHS18_24            -1.488e-01  6.717e-02  -2.215 0.026863 *  
## PctHS18_24               2.095e-01  6.240e-02   3.357 0.000797 ***
## PctSomeCol18_24         -6.971e-03  3.614e-02  -0.193 0.847055    
## PctBachDeg18_24         -9.587e-02  1.143e-01  -0.839 0.401691    
## PctHS25_Over             4.132e-01  9.628e-02   4.291 1.83e-05 ***
## PctBachDeg25_Over       -1.157e+00  1.541e-01  -7.511 7.68e-14 ***
## PctEmployed16_Over      -5.098e-01  9.479e-02  -5.378 8.10e-08 ***
## PctUnemployed16_Over     1.415e-01  1.653e-01   0.856 0.391903    
## PctPrivateCoverage      -4.476e-01  1.421e-01  -3.151 0.001643 ** 
## PctPrivateCoverageAlone -3.438e-02  8.010e-02  -0.429 0.667825    
## PctEmpPrivCoverage       3.836e-01  1.049e-01   3.657 0.000260 ***
## PctPublicCoverage       -2.826e-01  2.285e-01  -1.236 0.216424    
## PctPublicCoverageAlone   3.699e-01  2.827e-01   1.308 0.190946    
## PctWhite                -1.275e-01  5.711e-02  -2.232 0.025673 *  
## PctBlack                -4.774e-02  5.517e-02  -0.865 0.386969    
## PctAsian                -4.135e-02  1.879e-01  -0.220 0.825853    
## PctOtherRace            -8.903e-01  1.236e-01  -7.201 7.54e-13 ***
## PctMarriedHouseholds    -1.304e+00  1.581e-01  -8.246 2.41e-16 ***
## BirthRate               -8.799e-01  1.926e-01  -4.569 5.10e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.26 on 3015 degrees of freedom
## Multiple R-squared:  0.5234, Adjusted R-squared:  0.5185 
## F-statistic: 106.8 on 31 and 3015 DF,  p-value: < 2.2e-16
df <- cancer.dataset.filled %>% pivot_longer(!starts_with("TARGET"), values_to = "vals",  names_to = "vars")
for(c in colnames(cancer.dataset.filled)){
  if(c != "TARGET_deathRate"){
    p <- ggplot(df %>% filter(vars == !!c)) +
      geom_point(aes(x=vals, y=TARGET_deathRate), alpha = 0.5) +
      labs(x = c, title = paste(c,"vs death rate"))
    print(p)
  }
}

We notice that the variable MedianAge has some outliers with impossibly high values. It is likely that those points need to be divided by 10, similarly, some values of AvgHouseholdSize need to be multiplied by 100.

cancer.dataset.filled <- cancer.dataset.filled %>% mutate(MedianAge = ifelse(MedianAge>100, MedianAge/10, MedianAge))
cancer.dataset.filled <- cancer.dataset.filled %>% mutate(AvgHouseholdSize = ifelse(AvgHouseholdSize<1, AvgHouseholdSize*100, AvgHouseholdSize))
df <- cancer.dataset.filled %>% pivot_longer(!starts_with("TARGET"), values_to = "vals",  names_to = "vars")
ggplot(df %>% filter(vars == "MedianAge")) +
      geom_point(aes(x=vals, y=TARGET_deathRate), alpha = 0.5) +
      labs(x = "MedianAge", title = paste("MedianAge","vs death rate"))

ggplot(df %>% filter(vars == "AvgHouseholdSize")) +
      geom_point(aes(x=vals, y=TARGET_deathRate), alpha = 0.5) +
      labs(x = "AvgHouseholdSize", title = paste("AvgHouseholdSize","vs death rate"))

We compute the lasso model:

y <- cancer.dataset.filled$TARGET_deathRate
X <- cancer.dataset.filled %>% select(-TARGET_deathRate)
cv.out <- cv.glmnet(y = y, x = as.matrix(X), alpha = 1)
plot(cv.out)

bestlamda <- cv.out$lambda.min
bestlamda
## [1] 0.04210501
lasso.model <- glmnet(y=y, x=X, alpha = 1, lambda = bestlamda)

and these are the computed coefficients:

coef(lasso.model)
## 32 x 1 sparse Matrix of class "dgCMatrix"
##                                    s0
## (Intercept)              1.885567e+02
## avgAnnCount             -2.788806e-03
## avgDeathsPerYear         1.241877e-02
## incidenceRate            1.908705e-01
## medIncome                9.329769e-05
## popEst2015              -8.987387e-06
## povertyPercent           2.731418e-01
## studyPerCap              .           
## MedianAge               -5.917602e-03
## MedianAgeMale           -4.163005e-01
## MedianAgeFemale         -3.314116e-01
## AvgHouseholdSize        -8.180784e+00
## PercentMarried           8.632295e-01
## PctNoHS18_24            -1.159079e-01
## PctHS18_24               2.408217e-01
## PctSomeCol18_24          .           
## PctBachDeg18_24         -4.393079e-02
## PctHS25_Over             3.931600e-01
## PctBachDeg25_Over       -1.167440e+00
## PctEmployed16_Over      -3.951473e-01
## PctUnemployed16_Over     2.580653e-01
## PctPrivateCoverage      -5.814308e-01
## PctPrivateCoverageAlone -6.203856e-03
## PctEmpPrivCoverage       3.721590e-01
## PctPublicCoverage       -1.302624e-01
## PctPublicCoverageAlone   2.210668e-01
## PctWhite                -1.394536e-01
## PctBlack                -4.006803e-02
## PctAsian                 .           
## PctOtherRace            -8.534669e-01
## PctMarriedHouseholds    -8.026494e-01
## BirthRate               -8.129336e-01

As the lasso method does not filter many variables, we decide do a pre-filtering with backward feature selection:

mod.bkw <- regsubsets(TARGET_deathRate ~ ., data = cancer.dataset.filled, method = "backward", nvmax = 20)
mod.bkw.summary <- summary(mod.bkw)
print("BIC")
## [1] "BIC"
print(mod.bkw.summary$bic)
##  [1]  -802.9273 -1644.3033 -1829.8624 -1916.9769 -1961.9616 -1995.1886
##  [7] -2029.2523 -2039.9059 -2043.0725 -2071.0381 -2086.0472 -2084.0338
## [13] -2093.5109 -2099.3634 -2102.2901 -2102.0906 -2100.4461 -2096.0825
## [19] -2091.1504 -2086.2792
print("R²-adj")
## [1] "R²-adj"
mod.bkw.summary$adjr2
##  [1] 0.2354372 0.4212519 0.4566986 0.4732269 0.4821414 0.4889366 0.4957811
##  [8] 0.4986971 0.5003700 0.5060737 0.5096315 0.5104362 0.5130793 0.5151315
## [15] 0.5167113 0.5177913 0.5186407 0.5190595 0.5193882 0.5197263
plot(mod.bkw.summary$adjr2, main = "Backward [RED: best R², BLUE: best BIC]")
adjr2.max <- which.max(mod.bkw.summary$adjr2)
points(adjr2.max, mod.bkw.summary$adjr2[adjr2.max], 
        col = "red", cex = 2, pch = 20)
bic.min <- which.min(mod.bkw.summary$bic)
points(bic.min, mod.bkw.summary$adjr2[bic.min], 
        col = "blue", cex = 2, pch = 20)

to then apply the lasso technique again:

coef(mod.bkw, 15)
##          (Intercept)          avgAnnCount     avgDeathsPerYear 
##        185.347161100         -0.003515612          0.008580258 
##        incidenceRate        MedianAgeMale       PercentMarried 
##          0.196763938         -0.768055787          1.208040728 
##           PctHS18_24         PctHS25_Over    PctBachDeg25_Over 
##          0.249120176          0.364366022         -1.087705370 
##   PctEmployed16_Over   PctPrivateCoverage   PctEmpPrivCoverage 
##         -0.554605910         -0.661982733          0.497177431 
##             PctWhite         PctOtherRace PctMarriedHouseholds 
##         -0.103381638         -0.931121653         -1.285363044 
##            BirthRate 
##         -0.870231984
y <- cancer.dataset.filled$TARGET_deathRate
X <- cancer.dataset.filled %>% select(-TARGET_deathRate)
X <- X[, names(coef(mod.bkw, 15)[-1])]
cv.out <- cv.glmnet(y = y, x = as.matrix(X), alpha = 1)
plot(cv.out)

bestlamda <- cv.out$lambda.min
bestlamda
## [1] 0.01256265
lasso.model <- glmnet(y=y, x=X, alpha = 1, lambda = bestlamda)
coef(lasso.model)
## 16 x 1 sparse Matrix of class "dgCMatrix"
##                                 s0
## (Intercept)          184.585273409
## avgAnnCount           -0.003342230
## avgDeathsPerYear       0.008101041
## incidenceRate          0.196808007
## MedianAgeMale         -0.749289236
## PercentMarried         1.150298205
## PctHS18_24             0.249152239
## PctHS25_Over           0.363376116
## PctBachDeg25_Over     -1.090545980
## PctEmployed16_Over    -0.538866760
## PctPrivateCoverage    -0.655266381
## PctEmpPrivCoverage     0.480646968
## PctWhite              -0.100804637
## PctOtherRace          -0.928276429
## PctMarriedHouseholds  -1.243061786
## BirthRate             -0.861237859

Obtaining the following values for MSE, RMSE and MAE. In this case we used the whole (training) dataset:

y_est <- predict(lasso.model, as.matrix(X))
print("MSE:")
## [1] "MSE:"
print(sum((y-y_est)^2/nrow(X)))
## [1] 370.2728
print("RMSE:")
## [1] "RMSE:"
print(sqrt(sum((y-y_est)^2/nrow(X))))
## [1] 19.24247
print("MAE:")
## [1] "MAE:"
print(sum(abs(y-y_est)/nrow(X)))
## [1] 14.20949

Finally, can have a better understanding of the model’s performance using cross validation:

k <- 10
y <- cancer.dataset.filled$TARGET_deathRate
X <- cancer.dataset.filled %>% select(-TARGET_deathRate)
fold_group <- runif(nrow(X), min = 1, max = k+1) %>% trunc()
X <- X %>% mutate(fold_group = fold_group)
error <- 0
error_abs <- 0
for(fold in 1:k){
  # --- TRAIN DATASET
  # - use backward method
  df_train <- cancer.dataset.filled %>% filter(fold_group != fold)
  y <- df_train$TARGET_deathRate
  X <- df_train %>% select(-TARGET_deathRate)
  mod.bkw <- regsubsets(TARGET_deathRate ~ ., data = cancer.dataset.filled, method = "backward", nvmax = 20)
  mod.bkw.summary <- summary(mod.bkw)
  bic.min <- which.min(mod.bkw.summary$bic)
  # - use lasso
  # filter unused variables
  X <- X[, names(coef(mod.bkw, bic.min)[-1])]
  # compute lambda
  cv.out <- cv.glmnet(y = y, x = as.matrix(X), alpha = 1)
  bestlamda <- cv.out$lambda.min
  lasso.model <- glmnet(y=y, x=X, alpha = 1, lambda = bestlamda)
  # --- TEST DATASET
  df_test <- cancer.dataset.filled %>% filter(fold_group == fold)
  y <- df_test$TARGET_deathRate
  X <- df_test %>% select(-TARGET_deathRate)
  X <- X[, (coef(lasso.model)[,1] %>% names())[-1]] # used variables
  y_est <- predict(lasso.model, as.matrix(X))
  # --- COMPUTE METRICS
  error <- error + sum((y-y_est)^2)
  print(paste("MSE of fold ", fold, " :"))
  print(sum((y-y_est)^2)/nrow(X))
  error_abs <- error_abs + sum(sum(abs(y-y_est)))
  print(paste("MAE of fold ", fold, " :"))
  print(sum(abs(y-y_est))/nrow(X))
}
## [1] "MSE of fold  1  :"
## [1] 329.1637
## [1] "MAE of fold  1  :"
## [1] 13.97176
## [1] "MSE of fold  2  :"
## [1] 402.5424
## [1] "MAE of fold  2  :"
## [1] 15.01391
## [1] "MSE of fold  3  :"
## [1] 302.2367
## [1] "MAE of fold  3  :"
## [1] 13.20454
## [1] "MSE of fold  4  :"
## [1] 435.5085
## [1] "MAE of fold  4  :"
## [1] 15.2009
## [1] "MSE of fold  5  :"
## [1] 323.8388
## [1] "MAE of fold  5  :"
## [1] 13.18518
## [1] "MSE of fold  6  :"
## [1] 380.3584
## [1] "MAE of fold  6  :"
## [1] 14.74062
## [1] "MSE of fold  7  :"
## [1] 340.5263
## [1] "MAE of fold  7  :"
## [1] 13.86783
## [1] "MSE of fold  8  :"
## [1] 454.5657
## [1] "MAE of fold  8  :"
## [1] 14.75693
## [1] "MSE of fold  9  :"
## [1] 412.1573
## [1] "MAE of fold  9  :"
## [1] 15.25759
## [1] "MSE of fold  10  :"
## [1] 365.3386
## [1] "MAE of fold  10  :"
## [1] 13.82159
print("========================================")
## [1] "========================================"
print("FINAL MSE")
## [1] "FINAL MSE"
print(error/nrow(cancer.dataset.filled))
## [1] 375.5833
print("FINAL RMSE")
## [1] "FINAL RMSE"
print(sqrt(error/nrow(cancer.dataset.filled)))
## [1] 19.37997
print("FINAL MAE")
## [1] "FINAL MAE"
print(error_abs/nrow(cancer.dataset.filled))
## [1] 14.31538

Sowing how the previous values were (a little) optimistic, as expected. Such error is not that large when considering the distribution of the response variable:

ggplot(cancer.dataset.filled) +
  geom_histogram(aes(x=TARGET_deathRate), fill="white", color="black") +
  labs(title = "Death rate distribution")

We have shown how it is possible to use linear regression combined with lasso and backward feature selection to create a linear regression model to predict the death rates of cancer given economic and anagraphic data.